-
Notifications
You must be signed in to change notification settings - Fork 63
Fix issues with initial start time and restarts in the Prometheus receiver. #598
Fix issues with initial start time and restarts in the Prometheus receiver. #598
Conversation
return true | ||
} | ||
|
||
func (ma *MetricsAdjuster) adjustPoints(metricType metricspb.MetricDescriptor_Type, current, initial []*metricspb.Point) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can probably get rid of adjustPoints. In adjustTimeseries
func (ma *MetricsAdjuster) adjustTimeseries(metricType metricspb.MetricDescriptor_Type, current, initial *metricspb.TimeSeries) bool {
cPoints := current.GetPoints()
iPoints := initial.GetPoints()
if len(cPoints) !=1 || len(iPoints) != 1 {
// log if needed.
return false
}
if !ma.adjustPoint(metricType, cPoints[0], iPoints[0]) {
return false
}
current.StartTimestamp = initial.StartTimestamp
return true
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've simplified it - ptal.
// note: sum of squared deviation not currently supported | ||
initialDist := initial.GetDistributionValue() | ||
currentDist := current.GetDistributionValue() | ||
if currentDist.Count < initialDist.Count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be if either count or sum is less than previous value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
func (ma *MetricsAdjuster) adjustBuckets(current, initial []*metricspb.DistributionValue_Bucket) bool { | ||
if len(current) != len(initial) { | ||
// this shouldn't happen | ||
ma.logger.Info("len(current buckets) != len(initial buckets)", len(current), len(initial)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you change this to Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally avoid using Error unless something major is going on but that's based on Google prod convention - what is the convention for the oc agent when Error is appropriate?
Oops, looks like this PR is addressing the same issue with #597 which I submitted earlier. I had actually tried a similar approach at the very beginning by caching the first generated TimeSeries, however, it's not going to work properly in some cases, including:
from an Appender's perspective, only the label then in the 2nd run, you get a new one with some empty labels are filled
now the timeseries will have all 4 labels in the 2nd run. and it will be challenging to link them back. not only the label values, the label orders are also important.
You might want to take a look at the PR I have submitted and take the works from there. I have also added quite a lot tests trying to capture all these corner cases as well as a redesigned EndToEnd test which is able to produce more stable results and simulate edge cases like remote server is not responding properly |
#597 covers more cases than #598. However, in #598 MetricAdjuster is independent of scrapping which is useful another envoy prometheus receiver which converts from protobuf (instead of text) to opencensus-proto. |
@fivesheep - thank you for the detailed response and my apologies in they delay in responding. As @rghetia mentioned, I took this approach because we'd like to reuse the logic across both the text and proto version of the prometheus receiver. In response to your particular comments:
|
@dinooliva it does make sense to have a generic solution which is able to adjust income metrics values for different Prometheus protocols . I am not sure if it's possible to make it one more layer up to be a more generic solution for any metrics receivers which has the need to adjust values, with a simple interface like:
and make it as an option for the metric receiver factory. other than that in #597 I have also refactored the metricbuilder and add a layer called metric family which makes the code cleaner and easier to understand, you might also want to take a look at it. |
@fivesheep - I spent some time looking over your solution and your notes. I think it may be best to go with a combined solution. My module is strictly for adjusting start times and values based on discovered initial timeseries and resets. Dealing with newly discovered labels and empty histograms/summaries are not really issues that the metrics adjuster should deal with (I just have to make sure that the signature calculation works appropriately for empty label values) but those issues do need to be dealt with in the metricsbuilder, as you have done. I plan to add a couple of commits that deal with the issues 1 & 2. Issue 3 I think should be handled by the metricsbuilder and issue 4 I think can also be dealt with using your solution. |
Codecov Report
@@ Coverage Diff @@
## master #598 +/- ##
==========================================
+ Coverage 68.61% 68.89% +0.28%
==========================================
Files 91 92 +1
Lines 5939 6064 +125
==========================================
+ Hits 4075 4178 +103
- Misses 1650 1668 +18
- Partials 214 218 +4
Continue to review full report at Codecov.
|
PTAL This pr now addresses the issue with restarts and adds more unit tests. Once this pr has been merged, I will submit another pr to deal with cleaning up the cache (I don't want this pr to get too complicated). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of nits. LGTM otherwise.
latestValue := initialValue | ||
if initial != latest { | ||
latestValue += latest.GetDoubleValue() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the word 'latest' confusing. May be previous would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
runScript(t, script) | ||
} | ||
|
||
func Test_multiMetrics(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiMetrics test covers everything that individual metric test covers. may be just keep the mulitmetric test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this as well - the individual tests are easier for debugging purposes and documenting what should happen for the different aggregation types.
Unless you object, I propose keeping them for now - if maintaining them is problematic, we can remove them subsequently.
@songy23 - ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor question, otherwise LGTM.
} | ||
|
||
// returns true if at least one of the metric's timeseries was adjusted and false if all of the timeseries are an initial occurence or a reset. | ||
// Types of metrics returned supported by prometheus: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about gauge and cumulative int64 values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kinds of metrics aren't generated by the Prometheus -> OC Metrics translation so I didn't add support for them.
We've merged this code, which resolves issue 1 of the 4 issues that you originally noted. Of the other 3 issues: Issue 2 needs to be handled by the metrics adjuster and I'm working on a solution to it. I had been planning to cleanup at the job-level rather than at the timeseries level but that doesn't seem to work for your use-case (cadvisor) so I will look into handling individual timeseries. Issue 3 is separate from the metrics adjuster and your code already deals with it. Issue 4 is also separate from the metrics adjuster and, again, your code deals with it. Would you be able to update your pr to remove the issue 1 and issue 2 related code while keeping the rest? |
@dinooliva sure will do. are we still going to have a release with this two PRs, or release will only be in the new project? |
@fivesheep I think it would make sense to make a new release with these changes but I'll have to consult with the project owners to see what their current process is. @songy23 - any thoughts? |
We can still make patch releases for small improvement, bug fixes etc. @flands Could you help? |
That are a few pending PRs that started before we made the announcement that we were in "maintenance mode". Assuming that this can wait a bit let's close those before cutting a new release. |
Code is complete - ptal. Currently working on updating/adding unit tests.